摘要 :
With the increased complexity of digital architectures and aggregation of specialized hardware, functional simulation has become a major bottleneck in digital design. During functional and performance verification of a design, eng...
展开
With the increased complexity of digital architectures and aggregation of specialized hardware, functional simulation has become a major bottleneck in digital design. During functional and performance verification of a design, engineers make several iterations to determine the impact of code changes into the simulation result. These iterations are time-consuming both because the compilation time of hardware description to binary is slow and because simulation can take several hours until the point of interest is reached. In contrast, live programming environments allow developers to manipulate the system under development as it is being run. They have become increasingly popular as they provide rapid feedback, yet there is no available live environment for hardware development. In this paper, we propose a live programming and simulation environment that targets hardware design. Our approach is language-independent and leverages incremental compilation, hot binary reloading, and checkpointing to provide fast feedback to the user. We take special care to not replicate code for multiple instances of the same module and thus prevent code bloat, for instance, for multi-and many-core architectures. Our framework also is careful in verifying the consistency across checkpoints, to leverage parallel execution and reduce the amount of code that requires compilation. Our results show that this approach can provide simulation feedback in under 2 seconds, even when simulating a 256 RISC-V multicore architecture. As a reference, Verilator did not finish compiling this architecture after 24 hours of runtime.
收起
摘要 :
With the increased complexity of digital architectures and aggregation of specialized hardware, functional simulation has become a major bottleneck in digital design. During functional and performance verification of a design, eng...
展开
With the increased complexity of digital architectures and aggregation of specialized hardware, functional simulation has become a major bottleneck in digital design. During functional and performance verification of a design, engineers make several iterations to determine the impact of code changes into the simulation result. These iterations are time-consuming both because the compilation time of hardware description to binary is slow and because simulation can take several hours until the point of interest is reached. In contrast, live programming environments allow developers to manipulate the system under development as it is being run. They have become increasingly popular as they provide rapid feedback, yet there is no available live environment for hardware development. In this paper, we propose a live programming and simulation environment that targets hardware design. Our approach is language-independent and leverages incremental compilation, hot binary reloading, and checkpointing to provide fast feedback to the user. We take special care to not replicate code for multiple instances of the same module and thus prevent code bloat, for instance, for multi-and many-core architectures. Our framework also is careful in verifying the consistency across checkpoints, to leverage parallel execution and reduce the amount of code that requires compilation. Our results show that this approach can provide simulation feedback in under 2 seconds, even when simulating a 256 RISC-V multicore architecture. As a reference, Verilator did not finish compiling this architecture after 24 hours of runtime.
收起
摘要 :
This paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA ...
展开
This paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA applications. Manual pipelining, however, often requires significant efforts from FPGA designers who need to explore various changes in the RTL and re-run the flow iteratively. The automatic pipelining approach described in this paper, in contrast, allows FPGA users to explore latency vs. performance trade-offs of their designs before investing time and effort into modifying RTL. We describe algorithms behind these tools which use simple cut heuristics to maximize performance improvement while minimizing additional latency and register overhead. To demonstrate the effectiveness of the proposed approach, we analyse a set of 93 commercial FPGA applications and IP blocks mapped to Xilinx UltraScale+ and UltraScale generations of FPGAs. The results show that extra pipelining can provide from 18% to 29% potential Fmax improvement on average. It also shows that the distribution of improvements is bimodal, with almost half of benchmark suite designs showing no improvement due to the presence of large loops. Finally, we demonstrate that highly-pipelined designs map well to UltraScale+ and UltraScale FPGA architectures. Our approach demonstrates 19% and 20% Fmax improvement potential for the UltraScale+ and UltraScale architectures respectively, with the majority of applications reaching their loop limit through pipelining.
收起
摘要 :
This paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA ...
展开
This paper describes the methodology and algorithms behind extra pipeline analysis tools released in the Xilinx Vivado Design Suite version 2015.3. Extra pipelining is one of the most effective ways to improve performance of FPGA applications. Manual pipelining, however, often requires significant efforts from FPGA designers who need to explore various changes in the RTL and re-run the flow iteratively. The automatic pipelining approach described in this paper, in contrast, allows FPGA users to explore latency vs. performance trade-offs of their designs before investing time and effort into modifying RTL. We describe algorithms behind these tools which use simple cut heuristics to maximize performance improvement while minimizing additional latency and register overhead. To demonstrate the effectiveness of the proposed approach, we analyse a set of 93 commercial FPGA applications and IP blocks mapped to Xilinx UltraScale+ and UltraScale generations of FPGAs. The results show that extra pipelining can provide from 18% to 29% potential Fmax improvement on average. It also shows that the distribution of improvements is bimodal, with almost half of benchmark suite designs showing no improvement due to the presence of large loops. Finally, we demonstrate that highly-pipelined designs map well to UltraScale+ and UltraScale FPGA architectures. Our approach demonstrates 19% and 20% Fmax improvement potential for the UltraScale+ and UltraScale architectures respectively, with the majority of applications reaching their loop limit through pipelining.
收起
摘要 :
Pipeline depth and cycle time are fixed early in the chip design process but their impact can only be assessed when the implementation is mostly done and changing them is impractical. Elastic Systems are latency insensitive system...
展开
Pipeline depth and cycle time are fixed early in the chip design process but their impact can only be assessed when the implementation is mostly done and changing them is impractical. Elastic Systems are latency insensitive systems, and allow changes in the pipeline depth late in the design process with little design effort. Nevertheless, they have significant throughput penalty when new stages are added in the presence of pipeline loops. We propose Fluid Pipelines, an evolution that allows pipeline transformations without a throughput penalty. Formally, we introduce “or-causality” in addition to the already existing “and-causality” in Elastic Systems. It gives more flexibility than previously possible at the cost of having the designer to specify the intended behavior of the circuit. In an Out-of-Order core benchmark, Fluid Pipelines improve the optimal energy-delay point by shifting both performance (by 17%) and energy (by 13%). We envision a scenario where tools would be able to generate different pipeline configurations from the same RTL e.g., low power, high performance.
收起
摘要 :
Pipeline depth and cycle time are fixed early in the chip design process but their impact can only be assessed when the implementation is mostly done and changing them is impractical. Elastic Systems are latency insensitive system...
展开
Pipeline depth and cycle time are fixed early in the chip design process but their impact can only be assessed when the implementation is mostly done and changing them is impractical. Elastic Systems are latency insensitive systems, and allow changes in the pipeline depth late in the design process with little design effort. Nevertheless, they have significant throughput penalty when new stages are added in the presence of pipeline loops. We propose Fluid Pipelines, an evolution that allows pipeline transformations without a throughput penalty. Formally, we introduce “or-causality” in addition to the already existing “and-causality” in Elastic Systems. It gives more flexibility than previously possible at the cost of having the designer to specify the intended behavior of the circuit. In an Out-of-Order core benchmark, Fluid Pipelines improve the optimal energy-delay point by shifting both performance (by 17%) and energy (by 13%). We envision a scenario where tools would be able to generate different pipeline configurations from the same RTL e.g., low power, high performance.
收起
摘要 :
Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective inc...
展开
Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective incremental flows. We propose SMatch, a flow for FPGAs that has a novel incremental elaboration and novel incremental FPGA placement and routing that improves the state-of-the-art by reducing the amount of placement and routing work needed. We evaluate our approach against commercial FPGAs flows. Our method finishes synthesis, placement, and routing in under 30s for most changes of publicly available benchmarks with negligible QoR impact, being over $20 \times$ faster than existing incremental FPGA flows. CCS CONCEPTS •Hardware → Methodologies for EDA; Logic synthesis.
收起
摘要 :
Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective inc...
展开
Designers wait several hours to get synthesis, placement and routing results even for small changes. Commercial FPGA flows allow for resynthesis after code changes, however, they target large code changes with not so effective incremental flows. We propose SMatch, a flow for FPGAs that has a novel incremental elaboration and novel incremental FPGA placement and routing that improves the state-of-the-art by reducing the amount of placement and routing work needed. We evaluate our approach against commercial FPGAs flows. Our method finishes synthesis, placement, and routing in under 30s for most changes of publicly available benchmarks with negligible QoR impact, being over $20 \times$ faster than existing incremental FPGA flows. CCS CONCEPTS ?Hardware → Methodologies for EDA; Logic synthesis.
收起
摘要 :
Currently, one of the major bottlenecks in digital design is synthesis. Each iteration of a design takes several hours to synthesize, putting pressure on designers to carefully consider when to submit jobs and wait for the delayed...
展开
Currently, one of the major bottlenecks in digital design is synthesis. Each iteration of a design takes several hours to synthesize, putting pressure on designers to carefully consider when to submit jobs and wait for the delayed feedback. This delay is especially important in FPGA emulation, when synthesis is performed frequently while fixing the system functionality. This work proposes LiveSynth, a different approach for digital design with relatively quick feedback after small, incremental changes. Our approach delivers results with close-to-optimal quality-within a few seconds of processing time in most cases. LiveSynth was able to improve synthesis time by about 10× with minimal impact on QoR.
收起
摘要 :
Since the discovery of Shor's algorithm, the anxiety about quantum computation has increased. A large amount of research has been conducted to discover new algorithms and to build a quantum computer. But it seems that a general pu...
展开
Since the discovery of Shor's algorithm, the anxiety about quantum computation has increased. A large amount of research has been conducted to discover new algorithms and to build a quantum computer. But it seems that a general purpose quantum computer is far from being achieved. Meanwhile, cryptographers around the world started to look for security algorithms that resist to quantum attacks, but these still need improvement to achieve practical execution time. This work proposes a quantum-classical hybrid architecture, focusing on photonic quantum computers. A small quantum coprocessor implementing the Grover search algorithm is used to perform the search for roots of polynomials in F_(p~q). This coprocessor is used to accelerate the decoding process of the McEliece cryptosystem.
收起